pacman::p_load(sf, sfdep, tmap, tidyverse, knitr)Take-Home Exercise 3
Overview and Objectives
In this take-home my aim is to evaluate the necessary R packages necessary to perform:
Global Measures of Spatial Autocorrelation
Local Measures of Spatial Autocorrelation
This is to be done on the data which is the different types of crimes in Malaysia on the district level which we would layer with income inequality of Malaysia.
This also serves to prototype the Shiny application UI and choosing the right type of components
Data
Income Inequality Data: Household income inequality by district (https://data.gov.my/data-catalogue/hh_inequality_district)
Annual Principal Labour Force Statistics by District: Annual principal labour force statistics including unemployment and participation rates (https://data.gov.my/data-catalogue/lfs_district)
Crime Data: Crime rates by district (https://data.gov.my/data-catalogue/crime_district)
Malaysia - Subnational Administrative Boundaries: (https://data.humdata.org/dataset/cod-ab-mys?)
Packages
sf provides a standardised way to work with spatial vector data (points, lines, polygons)
spdep focuses on spatial econometrics and spatial statistics
tmap create thematic maps
tidyverse for easy data manipulation and some visualisation
knitr facilitates the integration of R code and documentation in reproducible research reports
Importing the Data
Before the UI prototyping can be done let’s see what type of data we are dealing with so that we can better plan for the UI components to be used
Let’s import the crime dataset
crime <- read_csv("data/aspatial/crime_district.csv")Rows: 19152 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): state, district, category, type
dbl (1): crimes
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s import the income inequality as well
income <- read_csv("data/aspatial/hh_inequality.csv")Rows: 318 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, district
dbl (1): gini
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Also import the annual principal labour force statistics
labour <- read_csv("data/aspatial/lfs_district.csv")Rows: 560 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, district
dbl (7): lf, lf_employed, lf_unemployed, lf_outside, p_rate, u_rate, ep_ratio
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
crime# A tibble: 19,152 × 6
state district category type date crimes
<chr> <chr> <chr> <chr> <date> <dbl>
1 Malaysia All assault all 2016-01-01 22327
2 Malaysia All assault all 2017-01-01 21366
3 Malaysia All assault all 2018-01-01 16902
4 Malaysia All assault all 2019-01-01 16489
5 Malaysia All assault all 2020-01-01 13279
6 Malaysia All assault all 2021-01-01 11495
7 Malaysia All assault all 2022-01-01 10348
8 Malaysia All assault all 2023-01-01 10453
9 Malaysia All assault causing_injury 2016-01-01 5531
10 Malaysia All assault causing_injury 2017-01-01 5024
# ℹ 19,142 more rows
From here we can identify some of the variables that we can use, that the user can interact with district, category, type, date, crimes
income# A tibble: 318 × 4
state district date gini
<chr> <chr> <date> <dbl>
1 Johor Batu Pahat 2019-01-01 0.295
2 Johor Batu Pahat 2022-01-01 0.338
3 Johor Johor Bahru 2019-01-01 0.388
4 Johor Johor Bahru 2022-01-01 0.359
5 Johor Kluang 2019-01-01 0.333
6 Johor Kluang 2022-01-01 0.354
7 Johor Kota Tinggi 2019-01-01 0.361
8 Johor Kota Tinggi 2022-01-01 0.343
9 Johor Kulai 2019-01-01 0.324
10 Johor Kulai 2022-01-01 0.337
# ℹ 308 more rows
Likewise for income we have district, date, gini
labour# A tibble: 560 × 10
state district date lf lf_employed lf_unemployed lf_outside p_rate
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Johor Batu Pahat 2019-01-01 214. 210. 3.7 90.4 70.3
2 Johor Batu Pahat 2020-01-01 219. 214. 4.7 92.2 70.4
3 Johor Batu Pahat 2021-01-01 216 211 5 97.1 69
4 Johor Batu Pahat 2022-01-01 221. 217. 4 94.6 70
5 Johor Johor Bah… 2019-01-01 792. 768. 24.6 294. 73
6 Johor Johor Bah… 2020-01-01 804. 771. 32.5 297. 73
7 Johor Johor Bah… 2021-01-01 806. 772. 34.1 298. 73
8 Johor Johor Bah… 2022-01-01 830. 799 30.5 293. 73.9
9 Johor Kluang 2019-01-01 166. 160. 5.6 66.9 71.3
10 Johor Kluang 2020-01-01 168. 161. 7.1 67.4 71.4
# ℹ 550 more rows
# ℹ 2 more variables: u_rate <dbl>, ep_ratio <dbl>
For labour we have district, date, lf, lf_employed, lf_unemployed, lf_outside, p_rate, u_rate, ep_ration
UI Design
For a shiny application in this course we work with three main components headerPanel, sidebarPanel, and mainPanel.
Header Panel : This is the topmost part of the UI where we can put a description of the application or have a navbar where you can navigate different pages. Each page leads to other group members work/part in this project
Sidebar Panel: This panel would mainly consist of the input controls that the user can play around with to change the map output in the Main Panel.
Main Panel : This is the primary area of the application and it typically contains outputs. The main panel displays the output (like maps, plots, tables, etc.) based on the input given in the sidebar panel.

Header Panel
For this we would like to put navbarPage() which shiny provides. This is so as to keep our project organised and it would be easier to navigate through the different pages that we would have

Side Panel
For this part it would be the input controls and given the potential variables the the data type we have identified: district, category, type, date, crimes, gini.
Some of the potential input controls that could be used are:







Something that our side panel that could look like given the variables that we are given:


Another possibility is to have the filters in a card look for the different sections of the plot e.g. global and local moran would be grouped together in their own card and EHSA would get their own card
Main Panel
Given that I am working with LISA maps and that having a comparison between two maps would be helpful for the user to visualise.

Another thing to consider would be is to have a tabset, for if I want to showcase more maps. So that it will be easier for the user to navigate.

This would also be roughly how our shiny application would look like with the different layouts
Data Wrangling
Looking at the crime csv file there are rows with “all” or “All” as the data. This seems to be a summary of the different crimes or summary for the different districts for the different years. So let’s remove the them
excluded_column <- "date"
crime <- crime[!apply(crime[, !names(crime) %in% excluded_column] == "all", 1, any), ]
crime <- crime[!apply(crime[, !names(crime) %in% excluded_column] == "All", 1, any), ]Let’s also add a column called year to the different csv files, to that it would be easier to split up the data into the different years
crime <- crime %>%
mutate(year = year(date))
income <- income %>%
mutate(year = year(date))
labour <- labour %>%
mutate(year = year(date))Let’s load Malaysia shape file and transform the crs into Malaysia
msia_sf <- read_sf(dsn = "data/geospatial/mys_adm_unhcr_20210211_shp",
layer = "mys_admbnda_adm2_unhcr_20210211") %>%
st_as_sf(coords =c(
"longitude", "latitude"),
crs = 4326) %>%
st_transform(crs = 3168)st_crs(msia_sf)Coordinate Reference System:
User input: EPSG:3168
wkt:
PROJCRS["Kertau (RSO) / RSO Malaya (m)",
BASEGEOGCRS["Kertau (RSO)",
DATUM["Kertau (RSO)",
ELLIPSOID["Everest 1830 (RSO 1969)",6377295.664,300.8017,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4751]],
CONVERSION["Rectified Skew Orthomorphic Malaya Grid (metre)",
METHOD["Hotine Oblique Mercator (variant A)",
ID["EPSG",9812]],
PARAMETER["Latitude of projection centre",4,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8811]],
PARAMETER["Longitude of projection centre",102.25,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8812]],
PARAMETER["Azimuth of initial line",323.0257905,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8813]],
PARAMETER["Angle from Rectified to Skew Grid",323.130102361111,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8814]],
PARAMETER["Scale factor on initial line",0.99984,
SCALEUNIT["unity",1],
ID["EPSG",8815]],
PARAMETER["False easting",804670.24,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Engineering survey, topographic mapping."],
AREA["Malaysia - West Malaysia onshore."],
BBOX[1.21,99.59,6.72,104.6]],
ID["EPSG",3168]]
Hole in boundary file
Next check if there are any holes with the boundary file
u_msia <- st_union(msia_sf)
plot(u_msia)
Missing row
Let’s do a check if there are any missing values in the crime data
na <- crime %>%
summarise(na_district = sum(is.na(district)),
na_category = sum(is.na(category)),
na_type = sum(is.na(type)),
na_date = sum(is.na(date)),
na_crimes = sum(is.na(crimes))
)
print(na)# A tibble: 1 × 5
na_district na_category na_type na_date na_crimes
<int> <int> <int> <int> <int>
1 0 0 0 0 0
Let’s also do a check for the income inequality data
na <- income %>%
summarise(na_district = sum(is.na(district)),
na_date = sum(is.na(date)),
na_gini = sum(is.na(gini))
)
print(na)# A tibble: 1 × 3
na_district na_date na_gini
<int> <int> <int>
1 0 0 0
And also for the labour data
na <- labour %>%
summarise(na_district = sum(is.na(district)),
na_date = sum(is.na(date)),
na_lf = sum(is.na(lf)),
na_lf_unemployed = sum(is.na(lf_unemployed)),
na_u_rate = sum(is.na(u_rate)),
)
print(na)# A tibble: 1 × 5
na_district na_date na_lf na_lf_unemployed na_u_rate
<int> <int> <int> <int> <int>
1 0 0 0 0 0
Left Join
Mismatch Districts
Having check everything else, let’s check whether is there any issues with msia_sf and crime
combined_data <- bind_cols(crime = sort(unique(crime$district)), msia_sf = sort(unique(msia_sf$ADM2_EN)))
# Create a new column to compare the values
combined_data <- combined_data %>%
mutate(same_values = crime == msia_sf) %>% filter(same_values == FALSE)
# View the result
combined_dataThis would generate an error regarding difference in the number of data, in the crime there are 159 districts while in msia_sf there are 144 districts.
Let’s run another code to see the difference
crime_unique <- data.frame(district = sort(unique(crime$district)))
msia_unique <- data.frame(ADM2_EN = sort(unique(msia_sf$ADM2_EN)))
# Find rows in crime_unique that don't have a match in msia_unique
unmatched_crime <- anti_join(crime_unique, msia_unique, by = c("district" = "ADM2_EN"))
# Find rows in msia_unique that don't have a match in crime_unique
unmatched_msia <- anti_join(msia_unique, crime_unique, by = c("ADM2_EN" = "district"))
# Combine results to see all mismatches
unmatched_crime district
1 Ampang Jaya
2 Arau
3 Bandar Bharu
4 Batu Gajah
5 Brickfields
6 Cameron Highland
7 Cheras
8 Dang Wangi
9 Gerik
10 Hulu Selangor
11 Ipoh
12 Iskandar Puteri
13 Johor Bahru Selatan
14 Johor Bahru Utara
15 Kajang
16 Kangar
17 Klang Selatan
18 Klang Utara
19 Kota Kinabatangan
20 Kota Samarahan
21 Kuala Lipis
22 Manjung
23 Matu Daro
24 Nilai
25 Nusajaya
26 Padang Besar
27 Padawan
28 Pengkalan Hulu
29 Petaling Jaya
30 Seberang Perai Selatan
31 Seberang Perai Tengah
32 Seberang Perai Utara
33 Selama
34 Sentul
35 Serdang
36 Seri Alam
37 Sg. Buloh
38 Shah Alam
39 Subang Jaya
40 Sungai Buloh
41 Sungai Siput
42 Taiping
43 Tanjong Malim
44 Tapah
45 Wangsa Maju
unmatched_msia ADM2_EN
1 Asajaya
2 Batang Padang
3 Daro
4 Johor Bahru
5 Kinabatangan
6 Kinta
7 Klang
8 Kuala Penyu
9 Larut Dan Matang
10 Lipis
11 Manjung (Dinding)
12 Matu
13 Nabawan
14 Pakan
15 Perlis
16 Petaling
17 Pitas
18 Pokok Sena
19 Putatan
20 S.P. Tengah
21 S.P. Utara
22 S.P.Selatan
23 Samarahan
24 Selangau
25 Tambunan
26 Tongod
27 Ulu Langat
28 Ulu Perak
29 Ulu Selangor
30 WP. Kuala Lumpur
From here we can actually see which data is missing in which file
Let’s see all the unique districts in the sf file
sort(unique(msia_sf$ADM2_EN)) [1] "Alor Gajah" "Asajaya" "Bachok"
[4] "Baling" "Bandar Baharu" "Barat Daya"
[7] "Batang Padang" "Batu Pahat" "Bau"
[10] "Beaufort" "Belaga" "Beluran"
[13] "Bentong" "Bera" "Besut"
[16] "Betong" "Bintulu" "Cameron Highlands"
[19] "Dalat" "Daro" "Dungun"
[22] "Gombak" "Gua Musang" "Hilir Perak"
[25] "Hulu Terengganu" "Jasin" "Jelebu"
[28] "Jeli" "Jempol" "Jerantut"
[31] "Johor Bahru" "Julau" "Kampar"
[34] "Kanowit" "Kapit" "Kemaman"
[37] "Keningau" "Kerian" "Kinabatangan"
[40] "Kinta" "Klang" "Kluang"
[43] "Kota Belud" "Kota Bharu" "Kota Kinabalu"
[46] "Kota Marudu" "Kota Setar" "Kota Tinggi"
[49] "Kuala Kangsar" "Kuala Krai" "Kuala Langat"
[52] "Kuala Muda" "Kuala Penyu" "Kuala Pilah"
[55] "Kuala Selangor" "Kuala Terengganu" "Kuantan"
[58] "Kubang Pasu" "Kuching" "Kudat"
[61] "Kulaijaya" "Kulim" "Kunak"
[64] "Lahad Datu" "Langkawi" "Larut Dan Matang"
[67] "Lawas" "Ledang" "Limbang"
[70] "Lipis" "Lubok Antu" "Lundu"
[73] "Machang" "Manjung (Dinding)" "Maran"
[76] "Marang" "Marudi" "Matu"
[79] "Melaka Tengah" "Meradong" "Mersing"
[82] "Miri" "Muar" "Mukah"
[85] "Nabawan" "Padang Terap" "Pakan"
[88] "Papar" "Pasir Mas" "Pasir Puteh"
[91] "Pekan" "Penampang" "Pendang"
[94] "Perak Tengah" "Perlis" "Petaling"
[97] "Pitas" "Pokok Sena" "Pontian"
[100] "Port Dickson" "Putatan" "Ranau"
[103] "Raub" "Rembau" "Rompin"
[106] "S.P. Tengah" "S.P. Utara" "S.P.Selatan"
[109] "Sabak Bernam" "Samarahan" "Sandakan"
[112] "Saratok" "Sarikei" "Segamat"
[115] "Selangau" "Semporna" "Sepang"
[118] "Seremban" "Serian" "Setiu"
[121] "Sibu" "Sik" "Simunjan"
[124] "Sipitang" "Song" "Sri Aman"
[127] "Tambunan" "Tampin" "Tanah Merah"
[130] "Tatau" "Tawau" "Temerloh"
[133] "Tenom" "Timur Laut" "Tongod"
[136] "Tuaran" "Tumpat" "Ulu Langat"
[139] "Ulu Perak" "Ulu Selangor" "W.P. Labuan"
[142] "W.P. Putrajaya" "WP. Kuala Lumpur" "Yan"
From here there is no easy way to fix this but to google the districts mentioned in crime and try to map it as close as close to the district in the sf file
crime <- crime %>%
mutate(district = recode(district,
# Johor Bahru mappings
"Iskandar Puteri" = "Johor Bahru",
"Nusajaya" = "Johor Bahru",
"Johor Bahru Selatan" = "Johor Bahru",
"Johor Bahru Utara" = "Johor Bahru",
"Seri Alam" = "Johor Bahru",
# Bandar Baharu correction
"Bandar Bharu" = "Bandar Baharu",
# WP Kuala Lumpur mappings
"Brickfields" = "WP. Kuala Lumpur",
"Cheras" = "WP. Kuala Lumpur",
"Dang Wangi" = "WP. Kuala Lumpur",
"Sentul" = "WP. Kuala Lumpur",
"Wangsa Maju" = "WP. Kuala Lumpur",
# Seremban correction
"Nilai" = "Seremban",
# Seberang Perai corrections
"Seberang Perai Selatan" = "S.P.Selatan",
"Seberang Perai Tengah" = "S.P. Tengah",
"Seberang Perai Utara" = "S.P. Utara",
# Cameron Highlands correction
"Cameron Highland" = "Cameron Highlands",
# Lipis correction
"Kuala Lipis" = "Lipis",
# Kinta mappings
"Batu Gajah" = "Kinta",
"Ipoh" = "Kinta",
# Ulu Perak mappings
"Gerik" = "Ulu Perak",
"Pengkalan Hulu" = "Ulu Perak",
# Manjung correction
"Manjung" = "Manjung (Dinding)",
# Larut Dan Matang mappings
"Selama" = "Larut Dan Matang",
"Taiping" = "Larut Dan Matang",
# Kuala Kangsar correction
"Sungai Siput" = "Kuala Kangsar",
# Batang Padang mappings
"Tanjong Malim" = "Batang Padang",
"Tapah" = "Batang Padang",
# Perlis mappings
"Arau" = "Perlis",
"Kangar" = "Perlis",
"Padang Besar" = "Perlis",
# Kinabatangan correction
"Kota Kinabatangan" = "Kinabatangan",
# Samarahan correction
"Kota Samarahan" = "Samarahan",
# Mukah correction
"Matu Daro" = "Mukah",
# Kuching correction
"Padawan" = "Kuching",
# Gombak correction
"Ampang Jaya" = "Gombak",
# Ulu Langat correction
"Kajang" = "Ulu Langat",
# Ulu Selangor correction
"Hulu Selangor" = "Ulu Selangor",
# Klang mappings
"Klang Selatan" = "Klang",
"Klang Utara" = "Klang",
# Petaling mappings
"Petaling Jaya" = "Petaling",
"Serdang" = "Petaling",
"Sg. Buloh" = "Petaling",
"Shah Alam" = "Petaling",
"Subang Jaya" = "Petaling",
"Sungai Buloh" = "Petaling",
# Default to keep original name if no match
.default = district))let’s check again to see if altered the data correctly
crime_unique <- data.frame(district = sort(unique(crime$district)))
# Find rows in crime_unique that don't have a match in msia_unique
unmatched_crime <- anti_join(crime_unique, msia_unique, by = c("district" = "ADM2_EN"))
unmatched_crime[1] district
<0 rows> (or 0-length row.names)
As we plan to overlay with the labour data, let’s do checks for that as well
labour_unique <- data.frame(district = sort(unique(labour$district)))
msia_unique <- data.frame(ADM2_EN = sort(unique(msia_sf$ADM2_EN)))
# Find rows in crime_unique that don't have a match in msia_unique
unmatched_labour <- anti_join(labour_unique, msia_unique, by = c("district" = "ADM2_EN"))
# Find rows in msia_unique that don't have a match in crime_unique
unmatched_msia <- anti_join(msia_unique, labour_unique, by = c("ADM2_EN" = "district"))
# Combine results to see all mismatches
unmatched_labour district
1 Hulu Perak
2 Kulai
3 Manjung
4 Maradong
5 Seberang Perai Selatan
6 Seberang Perai Tengah
7 Seberang Perai Utara
8 Tangkak
unmatched_msia ADM2_EN
1 Kulaijaya
2 Ledang
3 Manjung (Dinding)
4 Meradong
5 Perlis
6 S.P. Tengah
7 S.P. Utara
8 S.P.Selatan
9 Ulu Perak
10 W.P. Labuan
11 W.P. Putrajaya
12 WP. Kuala Lumpur
Let’s change the districts in labour like what we did for crime
labour <- labour %>%
mutate(district = recode(district,
"Kulai" = "Kulaijaya",
# Seberang Perai corrections
"Seberang Perai Selatan" = "S.P.Selatan",
"Seberang Perai Tengah" = "S.P. Tengah",
"Seberang Perai Utara" = "S.P. Utara",
# Ulu Perak mappings
"Hulu Perak" = "Ulu Perak",
# Manjung correction
"Manjung" = "Manjung (Dinding)",
"Maradong" = "Meradong",
"Tangkak" = "Ledang",
# Default to keep original name if no match
.default = district))Let’s check if there are still any issues with the district for labour
labour_unique <- data.frame(district = sort(unique(labour$district)))
msia_unique <- data.frame(ADM2_EN = sort(unique(msia_sf$ADM2_EN)))
# Find rows in crime_unique that don't have a match in msia_unique
unmatched_labour <- anti_join(labour_unique, msia_unique, by = c("district" = "ADM2_EN"))
unmatched_labour[1] district
<0 rows> (or 0-length row.names)
Let’s combine our labour data with our crimes data
crime_labour <- crime %>%
filter(year >= 2019 & year <= 2022) %>%
left_join(labour, by = c("district","year")) %>%
select(1:4,6,7,10,12,15)Let’s check for any empty rows before left_join
na <- crime_labour %>%
summarise(na_district = sum(is.na(district)),
na_category = sum(is.na(category)),
na_type = sum(is.na(type)),
na_crimes = sum(is.na(crimes)),
na_year = sum(is.na(year)),
na_lf = sum(is.na(lf)),
na_lf_unemployed = sum(is.na(lf_unemployed)),
na_u_rate = sum(is.na(u_rate)),
)
print(na)# A tibble: 1 × 8
na_district na_category na_type na_crimes na_year na_lf na_lf_unemployed
<int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 480 480
# ℹ 1 more variable: na_u_rate <int>
There are NA values so let’s remove them
crime_labour <- na.omit(crime_labour)Do another check
na <- crime_labour %>%
summarise(na_district = sum(is.na(district)),
na_category = sum(is.na(category)),
na_type = sum(is.na(type)),
na_crimes = sum(is.na(crimes)),
na_year = sum(is.na(year)),
na_lf = sum(is.na(lf)),
na_lf_unemployed = sum(is.na(lf_unemployed)),
na_u_rate = sum(is.na(u_rate)),
)
print(na)# A tibble: 1 × 8
na_district na_category na_type na_crimes na_year na_lf na_lf_unemployed
<int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 0 0
# ℹ 1 more variable: na_u_rate <int>
Finally with combine it with our msia_sf
msia <- left_join(msia_sf,crime_labour, by = c("ADM2_EN" = "district")) %>%
select(1,6,16:23)
msiaSimple feature collection with 7024 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 184853.1 ymin: 94420.8 xmax: 2380932 ymax: 829136
Projected CRS: Kertau (RSO) / RSO Malaya (m)
# A tibble: 7,024 × 11
ADM2_EN ADM1_EN state.x category type crimes year lf lf_unemployed
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Batu Pahat Johor Johor assault causing… 41 2019 214. 3.7
2 Batu Pahat Johor Johor assault causing… 43 2020 219. 4.7
3 Batu Pahat Johor Johor assault causing… 22 2021 216 5
4 Batu Pahat Johor Johor assault causing… 19 2022 221. 4
5 Batu Pahat Johor Johor assault murder 3 2019 214. 3.7
6 Batu Pahat Johor Johor assault murder 3 2020 219. 4.7
7 Batu Pahat Johor Johor assault murder 0 2021 216 5
8 Batu Pahat Johor Johor assault murder 3 2022 221. 4
9 Batu Pahat Johor Johor assault rape 29 2019 214. 3.7
10 Batu Pahat Johor Johor assault rape 16 2020 219. 4.7
# ℹ 7,014 more rows
# ℹ 2 more variables: u_rate <dbl>, geometry <MULTIPOLYGON [m]>
NA Values
Looking at this we could see some additional rows have been added. Let’s see if there are any NA values
na <- msia %>%
summarise(na_district = sum(is.na(ADM2_EN)),
na_category = sum(is.na(category)),
na_type = sum(is.na(type)),
na_crimes = sum(is.na(crimes)),
na_year = sum(is.na(year)),
na_lf = sum(is.na(lf)),
na_lf_unemployed = sum(is.na(lf_unemployed)),
na_u_rate = sum(is.na(u_rate)),
)
print(na)Simple feature collection with 1 feature and 8 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 184853.1 ymin: 94420.8 xmax: 2380932 ymax: 829136
Projected CRS: Kertau (RSO) / RSO Malaya (m)
# A tibble: 1 × 9
na_district na_category na_type na_crimes na_year na_lf na_lf_unemployed
<int> <int> <int> <int> <int> <int> <int>
1 0 16 16 16 16 16 16
# ℹ 2 more variables: na_u_rate <int>, geometry <MULTIPOLYGON [m]>
Let’s remove the NA rows
msia <- na.omit(msia)Do another check
na <- msia %>%
summarise(na_district = sum(is.na(ADM2_EN)),
na_category = sum(is.na(category)),
na_type = sum(is.na(type)),
na_crimes = sum(is.na(crimes)),
na_year = sum(is.na(year)),
na_lf = sum(is.na(lf)),
na_lf_unemployed = sum(is.na(lf_unemployed)),
na_u_rate = sum(is.na(u_rate)),
)
print(na)Simple feature collection with 1 feature and 8 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 184853.1 ymin: 94420.8 xmax: 2380932 ymax: 829136
Projected CRS: Kertau (RSO) / RSO Malaya (m)
# A tibble: 1 × 9
na_district na_category na_type na_crimes na_year na_lf na_lf_unemployed
<int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 0 0
# ℹ 2 more variables: na_u_rate <int>, geometry <MULTIPOLYGON [m]>
Let’s check for duplicates as well
duplicates <- msia %>%
group_by(ADM2_EN, year, category, type, crimes) %>%
filter(n() > 1)
if(nrow(duplicates) > 0) {
print("Duplicate combinations found!")
print(duplicates)
}[1] "Duplicate combinations found!"
Simple feature collection with 288 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 284518.8 ymin: 116154.7 xmax: 1641930 ymax: 656488.5
Projected CRS: Kertau (RSO) / RSO Malaya (m)
# A tibble: 288 × 11
# Groups: ADM2_EN, year, category, type, crimes [125]
ADM2_EN ADM1_EN state.x category type crimes year lf lf_unemployed
* <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Johor Bahru Johor Johor assault murder 3 2020 804. 32.5
2 Johor Bahru Johor Johor assault murder 3 2022 830. 30.5
3 Johor Bahru Johor Johor assault robber… 0 2019 792. 24.6
4 Johor Bahru Johor Johor assault robber… 0 2020 804. 32.5
5 Johor Bahru Johor Johor assault robber… 0 2022 830. 30.5
6 Johor Bahru Johor Johor assault robber… 0 2019 792. 24.6
7 Johor Bahru Johor Johor assault robber… 0 2020 804. 32.5
8 Johor Bahru Johor Johor assault robber… 0 2021 806. 34.1
9 Johor Bahru Johor Johor assault robber… 0 2022 830. 30.5
10 Johor Bahru Johor Johor property theft_… 5 2022 830. 30.5
# ℹ 278 more rows
# ℹ 2 more variables: u_rate <dbl>, geometry <MULTIPOLYGON [m]>
Global Measures of Spatial Autocorrelation
Calculating Neighbours and Weights
I would be defining neighbour’s based on Queens contiguity, and also let’s assign spatial weights to each neighbouring polygon
msia_nb_q <- st_contiguity(msia, queen=TRUE)As this takes time let’s write it to a rds file
write_rds(msia_nb_q, "data/rds/msia_nb_q.rds")Computing Row-Standardised Weight Matrix
Next, we need to assign spatial weights to each neighboring polygon.
st_weights() function from sfdep pacakge can be used to supplement a neighbour list with spatial weights
msia_wm_rs <- st_weights(msia_nb_q, style="W")We will mutate the newly created neighbour list object msia_nb_q and weight matrix msia_wm_rs into our existing msia. The result will be a new object, which we will call wm_q
wm_q <- msia %>%
mutate(nb = msia_nb_q,
wt = msia_wm_rs,
.before = 1) Global Moran’s I Test
To assess spatial autocorrelation in our dataset, or how the presence of crimes in a district may form clusters.
global_moran_test(wm_q$crimes,
wm_q$nb,
wm_q$wt,
zero.policy = TRUE,
na.action=na.omit)
Moran I test under randomisation
data: x
weights: listw
Moran I statistic standard deviate = 118.83, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
1.240600e-01 -1.427144e-04 1.092536e-06
From the test, the positive moran’s I statistic suggests that there is clustering, or a degree of spatial autocorrelation. This might be expected as when committing a crime, you might run to a neighbouring area. To commit another crime so as to be harder to track.
We can also see that the P-value is small. From a frequentist approach, we can see that this is unlikely to have occured by chance.
To strengthen our findings, we run a monte-carlo simulation.
set.seed(2932)Global Moran’s I permutation test
global_moran_perm(wm_q$crimes,
wm_q$nb,
wm_q$wt,
zero.policy = TRUE,
nsim = 99,
na.action=na.omit)
Monte-Carlo simulation of Moran I
data: x
weights: listw
number of simulations + 1: 100
statistic = 0.12406, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
From the outputs above, we can observe that the Moran’s I statistic (after 1000 permutations) for the 0.12406 with a p-value < 2.2e-16 which is similar to our previous result with low p-value which suggest that it did not happen randomly.
Local Moran I
Local Indicators of Spatial Association, or LISA, let us evaluate clusters between districts. Where higher values denote that the region is more heavily influenced by its surroundings.
Calculating Local Moran I
Calculating local Moran’s I statistics and append the results to the original dataframe as new columns.
wm_q <- wm_q %>%
mutate(local_moran = local_moran(
crimes, nb, wt, nsim = 99, zero.policy=TRUE),
.before = 1) %>%
unnest(local_moran)Visualising Local Moran I
lisa <- wm_q
map1 <- tm_shape(lisa) +
tm_fill("ii") +
tm_borders(alpha = 0.5) +
tm_view(set.zoom.limits = c(6,8)) +
tm_layout(main.title = "local Moran's I of No of crimes",
main.title.size = 0.8)
map2 <- tm_shape(lisa) +
tm_fill("p_ii",
breaks = c(0, 0.001, 0.01, 0.05, 1),
labels = c("0.001", "0.01", "0.05", "Not sig")) +
tm_borders(alpha = 0.5) +
tm_layout(main.title = "p-value of local Moran's I",
main.title.size = 0.8)
tmap_arrange(map1, map2, ncol = 2)Variable(s) "ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

LISA
The local indicator of spatial association (LISA) for each observation gives an indication of the extent of significant spatial clustering of similar values around that observation. LISA map is a categorical map showing type of outliers and clusters. There are two types of outliers namely: High-Low and Low-High outliers. Likewise, there are two type of clusters namely: High-High and Low-Low cluaters.
High-Low Outliers: Provinces with a high value of crimes, surrounded by neighbouring districts with low values of crimes.
Low-High Outliers: Provinces with a low value of crimes, surrounded by neighbouring districts with high values of crimes.
High-High Clusters: Provinces with a high value of crimes, surrounded by neighbouring districts with high values of crimes.
Low-Low Clusters: Provinces with a low value of crimes, surrounded by neighbouring districts with low values of crimes.
lisa_sig <- lisa %>% filter(p_ii_sim < 0.05)tm_shape(lisa) +
tm_polygons() +
tm_borders(alpha = 0.5) +
tm_shape(lisa_sig) +
tm_fill("mean", title = "LISA class") +
tm_borders(alpha = 0.4) +
tm_layout(main.title = "LISA map of crimes", main.title.size = 1)Warning: One tm layer group has duplicated layer types, which are omitted. To
draw multiple layers of the same type, use multiple layer groups (i.e. specify
tm_shape prior to each of them).

Emerging Hot Spot Analysis
Performing Emerging Hot Spot Analysis
EHSA we can—and likely should—incorporate the time-lag of our spatial neighbors. Secondly, there are classifications proposed by ESRI which help us understand how each location is changing over time. Both of these are handled by the emerging_hotspot_analysis() function.
This emerging_hotspot_analysis() takes a spacetime object x, and the quoted name of the variable of interested in .var at minimum. We can specify the number of time lags using the argument k which is set to 1 by default.
For this let’s create a st_data without the geometry and only doing it for assault for category and causing_injury for type
category_to_select <- c(
"assault"
)
type_to_select <- c(
"causing_injury"
)
msia_df <- msia %>% filter(category %in% category_to_select, type %in% type_to_select )
msia_df <- msia_df %>%
select(year, crimes, ADM2_EN) %>%
st_drop_geometry()msia_sf_filtered <- msia_sf %>%
semi_join(msia_df, by = "ADM2_EN")Next is to create a spacetime object
msia_spt <- spacetime(msia_df, msia_sf,
.loc_col = "ADM2_EN",
.time_col = "year")Let’s check if it is indeed a spacetime object
is_spacetime_cube(msia_spt)! Number of rows does not equal `n time-periods x n locations`
[1] FALSE
msia_df <- msia_df %>%
complete(year = unique(year),
ADM2_EN = unique(ADM2_EN),
fill = list(crimes = 0)) # Fill missing values with 0 or NA as needed
print(paste("Number of years:", length(unique(msia_df$year))))[1] "Number of years: 4"
print(paste("Number of locations:", length(unique(msia_df$ADM2_EN))))[1] "Number of locations: 128"
print(paste("Total rows:", nrow(msia_df)))[1] "Total rows: 584"
For our space time we are suppose to get 128*4 = 512 rows but we are getting 584 rows. So there could be cases of duplications
duplicates <- msia_df %>%
group_by(year, ADM2_EN) %>%
filter(n() > 1)
if(nrow(duplicates) > 0) {
print("Duplicate combinations found!")
print(duplicates)
}[1] "Duplicate combinations found!"
# A tibble: 120 × 3
# Groups: year, ADM2_EN [48]
year ADM2_EN crimes
<dbl> <chr> <dbl>
1 2019 Batang Padang 11
2 2019 Batang Padang 14
3 2019 Gombak 47
4 2019 Gombak 117
5 2019 Johor Bahru 46
6 2019 Johor Bahru 126
7 2019 Johor Bahru 140
8 2019 Johor Bahru 0
9 2019 Johor Bahru 50
10 2019 Kinta 3
# ℹ 110 more rows
Let’s remove them
msia_df <- msia_df %>%
group_by(year, ADM2_EN) %>%
summarise(crimes = mean(crimes, na.rm = TRUE)) %>%
ungroup()`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
check_duplicates <- msia_df %>%
group_by(year, ADM2_EN) %>%
filter(n() > 1)
print("Number of remaining duplicates:")[1] "Number of remaining duplicates:"
print(nrow(check_duplicates))[1] 0
Let’s create the spacetime object again
msia_spt <- spacetime(msia_df, msia_sf,
.loc_col = "ADM2_EN",
.time_col = "year")is_spacetime_cube(msia_spt)! Number of rows does not equal `n time-periods x n locations`
[1] FALSE
Another possible error could be that the msia_sf has more locations than what is in msia_df. Let’s check it
n_locations_df <- length(unique(msia_df$ADM2_EN))
n_locations_sf <- length(unique(msia_sf$ADM2_EN))
print(n_locations_df)[1] 128
print(n_locations_sf)[1] 144
In fact it is true, so lets change the msia_sf to only include those in msia_df
msia_sf_filtered <- msia_sf %>%
semi_join(msia_df, by = "ADM2_EN")
n_locations_df <- length(unique(msia_df$ADM2_EN))
n_locations_sf <- length(unique(msia_sf_filtered$ADM2_EN))
print(n_locations_df)[1] 128
print(n_locations_sf)[1] 128
Create the spacetime object again
msia_spt <- spacetime(msia_df, msia_sf_filtered,
.loc_col = "ADM2_EN",
.time_col = "year")is_spacetime_cube(msia_spt)[1] TRUE
We will perform EHSA analysis by using emerging_hotspot_analysis() of sfdep package. It takes a spacetime object msia_spt, and the name of the variable of interest crimes for .var argument.
ehsa <- emerging_hotspot_analysis(
x = msia_spt,
.var = "crimes",
k = 1,
nsim = 99
)Warning in spdep::poly2nb(geometry, queen = queen, ...): some observations have no neighbours;
if this seems unexpected, try increasing the snap argument.
Warning in spdep::poly2nb(geometry, queen = queen, ...): neighbour object has 4 sub-graphs;
if this sub-graph count seems unexpected, try increasing the snap argument.
We can then join them together
ehsa_sf <- left_join(msia_sf, ehsa, by = c("ADM2_EN" = "location"))Visualisation with Bar Graph
ggplot(data = ehsa,
aes(y = classification,fill = classification)) +
geom_bar(show.legend = FALSE)
Visualising Distribution of EHSA (85% Confidence)
EHSA_sig <- ehsa_sf %>%
filter(p_value < 0.15) tm_shape(ehsa_sf) +
tm_polygons() +
tm_borders(alpha = 0.5) +
tm_shape(EHSA_sig) +
tm_fill(col = "classification", title = "Classification") +
tm_borders(alpha = 0.4) +
tm_layout(main.title = "EHSA (>85%)", main.title.size = 1)Warning: One tm layer group has duplicated layer types, which are omitted. To
draw multiple layers of the same type, use multiple layer groups (i.e. specify
tm_shape prior to each of them).

Testing for Shiny App
All of the code above works well for getting the basics of Global and Local Spatial Autocorrelation and Emerging Hot Spot Analysis however when it comes to a Shiny application. The input is more dynamic as the user is able to choose his/her own inputs to alter the map, to make their own interpretation. Also the code takes awhile to run due to the size of the dataset to be used.
Filtering for Global and Local Spatial Correlation
Global Spatial
As Shiny Application is interactive and user chooses the input, it is essentially applying a filter on the dataset.
For msia we have
ADM2_EN: we can limit this to all or have the user to be able to select a mixture of districts, however it would not really make sense, given how spatial correlation is about clustering
category: there is only assault or property. So the user can select either or, or both
type: to show based on what category is chosen. User is able to select all or a mixture
year: For this it would be best to limit the user to only select 1 year or a max of 2 years. Given how long it takes to run the code for 4 years already
p_ii: This can also be adjusted in the form of a slider to show the confidence level where 95% confidence would be < 0.05 and 85% confidence is < 0.15. To show strong evidence.
Something of what the filter would look like
#in this case no filter is applied for ADM2_EN, it already contains all
ADM2_EN_to_select <- c("all")
category_to_select <- c("assault")
type_to_select <- c("causing_injury")
year_to_select <- c("2019")however for this i would like to only have the year filtered to have the worst case scenario for running time
msia_filtered <- msia %>% filter(year %in% year_to_select, )Afterwards we can perform our normal global and local
msia_filtered_nb_q <- st_contiguity(msia_filtered, queen=TRUE)This significantly cuts down the timing, but its still way longer than what a user would be willing to wait for. This would run for about 45 secs.
Limiting the category selection to either or might help it.
msia_filtered <- msia %>% filter(year %in% year_to_select, category %in% category_to_select )msia_filtered_nb_q <- st_contiguity(msia_filtered, queen=TRUE)Warning in spdep::poly2nb(geometry, queen = queen, ...): neighbour object has 4 sub-graphs;
if this sub-graph count seems unexpected, try increasing the snap argument.
This cuts down the timing by quite a fair bit but will it affect the output
msia_filtered_wm_rs <- st_weights(msia_filtered_nb_q, style="W")wm_q_filtered <- msia_filtered %>%
mutate(nb = msia_filtered_nb_q,
wt = msia_filtered_wm_rs,
.before = 1) global_moran_test(wm_q_filtered$crimes,
wm_q_filtered$nb,
wm_q_filtered$wt,
zero.policy = TRUE)
Moran I test under randomisation
data: x
weights: listw
Moran I statistic standard deviate = 19.464, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
1.391760e-01 -9.794319e-04 5.185241e-05
From the test, we can still see a positive moran’s I statistic suggests that there is clustering, or a degree of spatial autocorrelation.
We can also see that the P-value is small.
We can run monte-carlo simulation as well
global_moran_perm(wm_q_filtered$crimes,
wm_q_filtered$nb,
wm_q_filtered$wt,
zero.policy = TRUE,
nsim = 999,
na.action=na.omit)
Monte-Carlo simulation of Moran I
data: x
weights: listw
number of simulations + 1: 1000
statistic = 0.13918, observed rank = 1000, p-value < 2.2e-16
alternative hypothesis: two.sided
The values are about the same
Local Spatial
We can do Local Moran I as well
wm_q_filtered <- wm_q_filtered %>%
mutate(local_moran = local_moran(
crimes, nb, wt, nsim = 999, zero.policy=TRUE),
.before = 1) %>%
unnest(local_moran)So the above works with our filters applied. However the thing to consider is to whether to allow the user to be able select both category or a single one. As having a single category cuts down the processing time by a lot. But having more data also helps to portray the big picture better. But at the cost of time but not by a lot. It would still run it in under 1 minute.
Testing different types of plots
Different map arrangement
lisa <- wm_q_filtered
map1 <- tm_shape(lisa) +
tm_fill("ii") +
tm_borders(alpha = 0.5) +
tm_view(set.zoom.limits = c(6,8)) +
tm_layout(main.title = "local Moran's I of No of crimes",
main.title.size = 0.8)
map2 <- tm_shape(lisa) +
tm_fill("p_ii",
breaks = c(0, 0.001, 0.01, 0.05, 1),
labels = c("0.001", "0.01", "0.05", "Not sig")) +
tm_borders(alpha = 0.5) +
tm_layout(main.title = "p-value of local Moran's I",
main.title.size = 0.8)
tmap_arrange(map1, map2, ncol = 2)Variable(s) "ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

tm_shape(lisa) +
tm_fill("p_ii",
breaks = c(0, 0.001, 0.01, 0.05, 1),
labels = c("< 0.001", "< 0.01", "< 0.05", "Not sig")) +
tm_borders(alpha = 0.5) +
tm_text("u_rate", size = 0.4, col = "black") + # Adding `p_value_2` as a text label
tm_layout(main.title = "p-values of Two Variables",
main.title.size = 0.8)
Having this plot is one way to display the p_ii with the u_rate to see if there are any relation to unemployment rate and the high clustering.
lisa <- wm_q_filtered
map1 <- tm_shape(lisa) +
tm_fill("p_ii",
breaks = c(0, 0.001, 0.01, 0.05, 1),
labels = c("< 0.001", "< 0.01", "< 0.05", "Not sig")) +
tm_borders(alpha = 0.5) +
tm_view(set.zoom.limits = c(6,8)) +
tm_layout(main.title = "p-value of local Moran's I of No of crimes",
main.title.size = 0.8)
map2 <- tm_shape(lisa) +
tm_fill("u_rate") +
tm_borders(alpha = 0.5) +
tm_layout(main.title = "unemployment rate",
main.title.size = 0.8)
tmap_arrange(map1, map2, ncol = 1)
Having 2 Maps each filled with p_ii and u_rate and arranging them top and bottom might be easier to read
lisa <- wm_q_filtered
map1 <- tm_shape(lisa) +
tm_fill("p_ii",
breaks = c(0, 0.001, 0.01, 0.05, 1),
labels = c("< 0.001", "< 0.01", "< 0.05", "Not sig")) +
tm_borders(alpha = 0.5) +
tm_view(set.zoom.limits = c(6,8)) +
tm_layout(main.title = "p-value of local Moran's I of No of crimes",
main.title.size = 0.8)
map2 <- tm_shape(lisa) +
tm_polygons() +
tm_text("u_rate", size = 0.4, col = "black") +
tm_layout(main.title = "unemployment rate",
main.title.size = 0.8)
tmap_arrange(map1, map2, ncol = 1)
Having the text itself seems to be a better choice as its clear as to what is the actual value. But having this arrangement of top and bottom doesn’t seem to be best. Ultimately it might be the first plot with some adjustment of the text that seems to be the best
LISA Map
lisa <- wm_q_filtered
lisa_sig <- lisa %>% filter(p_ii_sim < 0.05)
tm_shape(lisa) +
tm_polygons() +
tm_borders(alpha = 0.5) +
tm_shape(lisa_sig) +
tm_fill("mean", title = "LISA class") +
tm_text("u_rate", size = 0.5, col = "black") +
tm_borders(alpha = 0.4) +
tm_layout(main.title = "LISA map of crimes", main.title.size = 1)Warning: One tm layer group has duplicated layer types, which are omitted. To
draw multiple layers of the same type, use multiple layer groups (i.e. specify
tm_shape prior to each of them).

Testing with different figure size
lisa <- wm_q_filtered
lisa_sig <- lisa %>% filter(p_ii_sim < 0.05)
tm_shape(lisa) +
tm_polygons() +
tm_borders(alpha = 0.5) +
tm_shape(lisa_sig) +
tm_fill("mean", title = "LISA class") +
tm_text("u_rate", size = 0.6, col = "black") +
tm_borders(alpha = 0.4) +
tm_layout(main.title = "LISA map of crimes", main.title.size = 1)Warning: One tm layer group has duplicated layer types, which are omitted. To
draw multiple layers of the same type, use multiple layer groups (i.e. specify
tm_shape prior to each of them).

lisa <- wm_q_filtered
lisa_sig <- lisa %>% filter(p_ii_sim < 0.05)
tm_shape(lisa) +
tm_polygons() +
tm_borders(alpha = 0.5) +
tm_shape(lisa_sig) +
tm_fill("mean", title = "LISA class") +
tm_text("u_rate", size = 0.8, col = "black") +
tm_borders(alpha = 0.4) +
tm_layout(main.title = "LISA map of crimes", main.title.size = 1)Warning: One tm layer group has duplicated layer types, which are omitted. To
draw multiple layers of the same type, use multiple layer groups (i.e. specify
tm_shape prior to each of them).

Testing interactive maps
tmap_mode("view")tmap mode set to interactive viewing
tm_shape(lisa) +
tm_fill("p_ii",
breaks = c(0, 0.001, 0.01, 0.05, 1),
labels = c("< 0.001", "< 0.01", "< 0.05", "Not sig")) +
tm_borders(alpha = 0.5) +
tm_text("u_rate", size = 0.6, col = "black") + # Adding `p_value_2` as a text label
tm_layout(main.title = "p-values of Two Variables",
main.title.size = 0.8)